1. RMarkdown basics

Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions

Description

Four \(x\)-\(y\) datasets which have the same traditional statistical properties (mean, variance, correlation, regression line, etc.), yet are quite different.

Usage

anscombe

Format

A data frame with 11 observations on the following 8 variables.

x1 == x2 == x3 the integers 4:14, specially arranged

x4 values 8 and 19

y1, y2, y3, y4 numbers in (3, 12.5) with mean 7.5 and sdev 2.03

Source

Tufte, Edward R. (1989). The Visual Display of Quantitative Information, 13–14. Graphics Press.

References

Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899.

Examples

require(stats); 
require(graphics)
require(knitr)
## Loading required package: knitr
summary(anscombe)
##        x1             x2             x3             x4           y1        
##  Min.   : 4.0   Min.   : 4.0   Min.   : 4.0   Min.   : 8   Min.   : 4.260  
##  1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 6.5   1st Qu.: 8   1st Qu.: 6.315  
##  Median : 9.0   Median : 9.0   Median : 9.0   Median : 8   Median : 7.580  
##  Mean   : 9.0   Mean   : 9.0   Mean   : 9.0   Mean   : 9   Mean   : 7.501  
##  3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.:11.5   3rd Qu.: 8   3rd Qu.: 8.570  
##  Max.   :14.0   Max.   :14.0   Max.   :14.0   Max.   :19   Max.   :10.840  
##        y2              y3              y4        
##  Min.   :3.100   Min.   : 5.39   Min.   : 5.250  
##  1st Qu.:6.695   1st Qu.: 6.25   1st Qu.: 6.170  
##  Median :8.140   Median : 7.11   Median : 7.040  
##  Mean   :7.501   Mean   : 7.50   Mean   : 7.501  
##  3rd Qu.:8.950   3rd Qu.: 7.98   3rd Qu.: 8.190  
##  Max.   :9.260   Max.   :12.74   Max.   :12.500

now some “magic” to do the 4 regressions in a loop:

ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  ## or   ff[[2]] <- as.name(paste0("y", i))
  ##      ff[[3]] <- as.name(paste0("x", i))
  mods[[i]] <- lmi <- lm(ff, data = anscombe)
  print(kable(anova(lmi)))
  cat('\n')
}
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.51000 27.510001 17.98994 0.0021696
Residuals 9 13.76269 1.529188 NA NA
Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.50000 27.500000 17.96565 0.0021788
Residuals 9 13.77629 1.530699 NA NA
Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.47001 27.470008 17.97228 0.0021763
Residuals 9 13.75619 1.528466 NA NA
Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.49000 27.490001 18.00329 0.0021646
Residuals 9 13.74249 1.526943 NA NA

See how close they are (numerically!)

sapply(mods, coef)
##                   lm1      lm2       lm3       lm4
## (Intercept) 3.0000909 3.000909 3.0024545 3.0017273
## x1          0.5000909 0.500000 0.4997273 0.4999091
lapply(mods, function(fm) coef(summary(fm)))
## $lm1
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x1          0.5000909  0.1179055 4.241455 0.002169629
## 
## $lm2
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x2          0.500000  0.1179637 4.238590 0.002178816
## 
## $lm3
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x3          0.4997273  0.1178777 4.239372 0.002176305
## 
## $lm4
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x4          0.4999091  0.1178189 4.243028 0.002164602

Now, do what you should have done in the first place: PLOTS

op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma =  c(0, 0, 2, 0))
for(i in 1:4) {
  ff[2:3] <- lapply(paste0(c("y","x"), i), as.name)
  plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange", cex = 1.2,
       xlim = c(3, 19), ylim = c(3, 13))
  abline(mods[[i]], col = "blue")
}
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex = 1.5)

par(op)

Just for fun, place the “Monstrous Costs” figure from Healy here (either find an image online, or print screen and crop). Make sure to align the figure to center, add a caption, and have the width set to 40%.

2. Analyze the gapminder interactive plot from the introduction.

Question 2.1

Place here the interactive animated plot from the introduction (you’d need to install and call the libraries gapminder, ggplot2 and plotly for it to work). Use the echo = FALSE option to hide the code.

Question 2.2

Without hovering over the markers to show the data that is associated with them, identify a marker that captures your attention (from one of the years). Using the notion of “preattentive search”, try to understand and explain in writing why this particular marker caught your attention. Identify the country that is associated with this marker. Were you surprised? Have you learned something that you didn’t know? Affirmed an intuition? Repeat this exercise, this time with a marker that captures your attention from the animated sequence.

Static Analysis: A marker captures my attention is the one with the highest GDP per capita in 1952. The marker is located in the middle right corner of the plot, and it is the only marker in that region. The country associated with this marker is Kuwait. I was surprised to see Kuwait as the country with the highest GDP per capita in 1952. I did not know that Kuwait had such a high GDP per capita in 1952. The extremely high GDP per capita of Kuwait in 1952 can be attributed to the discovery and exploitation of its vast oil reserves. Kuwait has one of the largest oil reserves in the world, and the oil industry began to significantly impact its economy in the late 1940s and early 1950s.

Animated Sequence Analysis: During the animation, the marker of China showing dramatic improvement in both GDP per capita and life expectancy over time catches my attention. This could represent the country has experienced rapid development and improvement in living standards. Observing such a transformation could highlight the impact of economic development and policy choices on health and well-being.

Question 2.3

For four of the seven “gestalt rules” of your choice that are enumerated in page 22, provide an example of the principle in practice in the gapminder plot.

  • Proximity: Things that are spatially near to one another seem to be related.

  • Closure: Incomplete shapes are perceived as complete.

  • Figure and ground: Visual elements are taken to be either in the foreground or in the background.

  • Common fate: Elements sharing a direction of movement are perceived as a unit.

Question 2.4

Starting to think: one thing we mentioned is that the gapminder plot does not highlight “inequality” very well. Visualizing inequalities entails visualizing distributions. Suggest a tentative method for highlighting aspects of inequality in these data that you find important. You may look online, refer to the diamonds app from the introduction or use any other source. You are not asked to provide any plots here, this is a teaser thought experiment (you may provide examples for visualizations you find relevant). We shall discuss visualization tools for comparing distributions at length throughout the class.

When considering how to visualize inequality in the context of the Gapminder data, we need to think about ways to represent the distribution of these metrics within each country, rather than just providing an average or a single data point per country. We can try applying several following methods that could be employed to highlight aspects of inequality:

Box-and-Whisker Plots: These plots show the median, quartiles, and extremes of data, which can highlight disparities within and between countries’ income distributions. Having a box-and-whisker plot for the GDP per capita of each country would show how spread out incomes are around the median.

Histograms and Density Plots: These can show the distribution of a single metric, like GDP per capita, across different population segments within a country. They could also be used to compare the distribution of wealth across countries.

Violin Plots: These plots combine the features of box plots and density plots, showing the distribution of a metric across different countries.